In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("project2.ipynb")

Project 2: Climate Change—Temperatures and Precipitation¶

In this project, you will investigate data on climate change, or the long-term shifts in temperatures and weather patterns!

Advice. Develop your answers incrementally. To perform a complicated task, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you expect. You can add any additional names or functions you want to the provided cells. Make sure that you are using distinct and meaningful variable names throughout the notebook. Along that line, DO NOT reuse the variable names that we use when we grade your answers.

To get started, load datascience, numpy, and matplotlib. Make sure to also run the first cell of this notebook to load otter.

In [2]:
# Run this cell to set up the notebook, but please don't change it.
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
np.set_printoptions(legacy='1.13')

import warnings
warnings.simplefilter('ignore')

import plotly.graph_objects as go

Part 1: Temperatures¶

In the following analysis, we will investigate one of the 21st century's most prominent issues: climate change. While the details of climate science are beyond the scope of this course, we can start to learn about climate change just by analyzing public records of different cities' temperature and precipitation over time.

We will analyze a collection of historical daily temperature and precipitation measurements from weather stations in 210 U.S. cities. The dataset was compiled by Yuchuan Lai and David Dzombak [1]; a description of the data from the original authors and the data itself is available here.

[1] Lai, Yuchuan; Dzombak, David (2019): Compiled historical daily temperature and precipitation data for selected 210 U.S. cities. Carnegie Mellon University. Dataset.

Part 1, Section 1: Cities¶

Run the following cell to load information about the cities and preview the first few rows.

In [3]:
cities = Table.read_table('city_info.csv', index_col=0)
cities.show(3)
Name ID Lat Lon Stn.Name Stn.stDate Stn.edDate
Lander USW00024021 42.8153 -108.726 LANDER WBO 1892-01-01 1946-05-28
Lander USW00024021 42.8153 -108.726 LANDER HUNT FIELD 1946-05-29 2021-12-31
Cheyenne USW00024018 41.1519 -104.806 CHEYENNE WBO 1871-01-01 1935-08-31

... (458 rows omitted)

The cities table has one row per weather station and the following columns:

  1. "Name": The name of the US city
  2. "ID": The unique identifier for the US city
  3. "Lat": The latitude of the US city (measured in degrees of latitude)
  4. "Lon": The longitude of the US city (measured in degrees of longitude)
  5. "Stn.Name": The name of the weather station in which the data was collected
  6. "Stn.stDate": A string representing the date of the first recording at that particular station
  7. "Stn.edDate": A string representing the date of the last recording at that particular station

The data lists the weather stations at which temperature and precipitation data were collected. Note that although some cities have multiple weather stations, only one is collecting data for that city at any given point in time. Thus, we are able to just focus on the cities themselves.

Question 1.1.1: In the cell below, produce a scatter plot that plots the latitude and longitude of every city in the cities table so that the result places northern cities at the top and western cities at the left.

Note: It's okay to plot the same point multiple times!

Hint: A latitude is the set of horizontal lines that measures distances north or south of the equator. A longitude is the set of vertical lines that measures distances east or west of the prime meridian.

In [4]:
cities.scatter("Lon","Lat")
No description has been provided for this image

These cities are all within the continental U.S., and so the general shape of the U.S. should be visible in your plot. The shape will appear distorted compared to most maps for two reasons: the scatter plot is square even though the U.S. is wider than it is tall, and this scatter plot is an equirectangular projection of the spherical Earth. A geographical map of the same data uses the common Pseudo-Mercator projection.

Note: If this visualization doesn't load for you, please view a version of it online here.

In [5]:
# Just run this cell
fig = go.Figure(data=go.Scattergeo(
        lon = cities.column('Lon'),
        lat = cities.column('Lat'),
        text = cities.column('Name'),
        mode = 'markers'))

fig.update_layout(geo_scope='usa', width=800, height=500, margin={'r': 0, 't': 0, 'l': 0, 'b': 0})
fig.show()

Question 1.1.2 Does it appear that these city locations are sampled uniformly at random from all the locations in the U.S.? Why or why not?

It does, as they seem to be evenly dispersed throughout the map.

Question 1.1.3: Assign num_unique_cities to the number of unique cities that appear in the cities table.

In [6]:
num_unique_cities = len(cities.group('Name').column('Name'))

# Do not change this line
print(f"There are {num_unique_cities} unique cities that appear within our dataset.")
There are 210 unique cities that appear within our dataset.
In [7]:
grader.check("q1_1_3")
Out[7]:

q1_1_3
passed! 🎉

In order to investigate further, it will be helpful to determine what region of the United States each city was located in: Northeast, Northwest, Southeast, or Southwest. For our purposes, we will be using the following geographical boundaries:

USA Coordinate Map

  1. A station is located in the "Northeast" region if its latitude is above or equal to 40 degrees and its longtitude is greater than or equal to -100 degrees.
  2. A station is located in the "Northwest" region if its latitude is above or equal to 40 degrees and its longtitude is less than -100 degrees.
  3. A station is located in the "Southeast" region if its latitude is below 40 degrees and its longtitude is greater than or equal to -100 degrees.
  4. A station is located in the "Southwest" region if its latitude is below 40 degrees and its longtitude is less than -100 degrees.

Question 1.1.4: Define the coordinates_to_region function below. It should take in two arguments, a city's latitude (lat) and longitude (lon) coordinates, and output a string representing the region it is located in.

In [8]:
def coordinates_to_region(lat,lon):
    if lat>=40 and lon>=-100:
        return "Northeast"
    elif lat>=40 and lon<-100:
        return "Northwest"
    elif lat<40 and lon>=-100:
        return "Southeast"
    else:
        return "Southwest"
In [9]:
grader.check("q1_1_4")
Out[9]:

q1_1_4
passed! 🚀

Question 1.1.5: Add a new column in cities labeled Region that contains the region in which the city is located.

Hint: Use the apply function together with the coordinates_to_region function you just defined.

In [10]:
regions_array = cities.apply(coordinates_to_region,'Lat','Lon')
cities = cities.with_column('Region',regions_array)
cities.show(5)
Name ID Lat Lon Stn.Name Stn.stDate Stn.edDate Region
Lander USW00024021 42.8153 -108.726 LANDER WBO 1892-01-01 1946-05-28 Northwest
Lander USW00024021 42.8153 -108.726 LANDER HUNT FIELD 1946-05-29 2021-12-31 Northwest
Cheyenne USW00024018 41.1519 -104.806 CHEYENNE WBO 1871-01-01 1935-08-31 Northwest
Cheyenne USW00024018 41.1519 -104.806 CHEYENNE MUNICIPAL ARPT 1935-09-01 2021-12-31 Northwest
Wausau USW00014897 44.9258 -89.6256 Wausau Record Herald 1896-01-01 1941-12-31 Northeast

... (456 rows omitted)

In [11]:
grader.check("q1_1_5")
Out[11]:

q1_1_5
passed! 💯

To confirm that you've defined your coordinates_to_region function correctly and successfully added the Region column to the cities table, run the following cell. Each region should have a different color in the result.

In [12]:
# Just run this cell
cities.scatter("Lon", "Lat", group="Region")
No description has been provided for this image

Part 1, Section 2: Welcome to Phoenix, Arizona¶

Each city has a different CSV file full of daily temperature and precipitation measurements. The file for Phoenix, Arizona is included with this project as phoenix.csv. The files for other cities can be downloaded here by matching them to the ID of the city in the cities table.

Since Phoenix is located on the upper edge of the Sonoran Desert, it has some impressive temperatures.

Run the following cell to load in the phoenix table. It has one row per day and the following columns:

  1. "Date": The date (a string) representing the date of the recording in YYYY-MM-DD format
  2. "tmax": The maximum temperature for the day (°F)
  3. "tmin": The minimum temperature for the day (°F)
  4. "prcp": The recorded precipitation for the day (inches)
In [13]:
phoenix = Table.read_table("phoenix.csv", index_col=0)
phoenix.show(3)
Date tmax tmin prcp
1896-01-01 66 30 0
1896-01-02 64 30 0
1896-01-03 68 30 0

... (46018 rows omitted)

Question 1.2.1: Assign the variable largest_2010_range_date to the date of the largest temperature range in Phoenix, Arizona for any day between January 1st, 2010 and December 31st, 2010. To get started, use the variable phoenix_with_ranges_2010 to filter the phoenix table to days in 2010 with an additional column corresponding to the temperature range for that day.

Your answer should be a string in the "YYYY-MM-DD" format. Feel free to use as many lines as you need. A temperature range is calculated as the difference between the max and min temperatures for the day.

Hint: To limit the values in a column to only those that contain a certain string, pick the right are. predicate from the Python Reference Sheet.

Note: Do not re-assign the phoenix variable; please use the phoenix_with_ranges_2010 variable instead.

In [14]:
def temp_range(max,min):
    return abs(max-min)
phoenixtemprange2010=phoenix.where('Date',are.containing('2010')).apply(temp_range,'tmax','tmin')
phoenix_with_ranges_2010 = phoenix.where('Date',are.containing('2010')).with_column('Temperature Range',phoenixtemprange2010)
phoenix_with_ranges_2010
largest_2010_range_date = (phoenix_with_ranges_2010.where("Temperature Range",max(phoenix_with_ranges_2010.column('Temperature Range'))).column('Date').item(0))
largest_2010_range_date
Out[14]:
'2010-06-24'
In [15]:
grader.check("q1_2_1")
Out[15]:

q1_2_1
passed! 🍀

We can look back to our phoenix table to check the temperature readings for our largest_2010_range_date to see if anything special is going on. Run the cell below to find the row of the phoenix table that corresponds to the date we found above.

In [16]:
# Just run this cell
phoenix.where("Date", largest_2010_range_date)
Out[16]:
Date tmax tmin prcp
2010-06-24 113 79 0

The function extract_year_from_date takes a date string in the YYYY-MM-DD format and returns an integer representing the year. The function extract_month_from_date takes a date string and returns a string describing the month. Run this cell, but you do not need to understand how this code works or edit it.

In [17]:
# Just run this cell
import calendar

def extract_year_from_date(date):
    """Returns an integer corresponding to the year of the input string's date."""
    return int(date[:4])

def extract_month_from_date(date):
    "Return an abbreviation of the name of the month for a string's date."
    month = date[5:7]
    return f'{month} ({calendar.month_abbr[int(date[5:7])]})'


# Example
print('2022-04-01 has year', extract_year_from_date('2022-04-01'),
      'and month', extract_month_from_date('2022-04-01'))
2022-04-01 has year 2022 and month 04 (Apr)

Question 1.2.2: Add two new columns called Year and Month to the phoenix table that contain the year as an integer and the month as a string (such as "04 (Apr)") for each day, respectively.

Note: The functions above may be helpful!

In [18]:
years_array = phoenix.apply(extract_year_from_date,'Date')
months_array = phoenix.apply(extract_month_from_date,'Date')
phoenix=phoenix.with_columns('Year',years_array,'Month',months_array)
phoenix.show(5)
Date tmax tmin prcp Year Month
1896-01-01 66 30 0 1896 01 (Jan)
1896-01-02 64 30 0 1896 01 (Jan)
1896-01-03 68 30 0 1896 01 (Jan)
1896-01-04 69 34 0 1896 01 (Jan)
1896-01-05 70 46 0 1896 01 (Jan)

... (46016 rows omitted)

In [19]:
grader.check("q1_2_2")
Out[19]:

q1_2_2
passed! 🌟

Question 1.2.3: Using the phoenix table, create an overlaid line plot of the average maximum temperature and average minimum temperature for each year between 1900 and 2020 (inclusive).

Hint: The `group' function will be helpful in computing the averages. It can take an optional argument allowing you to apply a function to the group.

Hint: To draw a line plot with more than one line, call plot on the column label of the x-axis values and enter the y-axis laels in square brackets like [col1_name, col2_name].

In [20]:
avgphoenix=phoenix.group(['Year'],np.average).drop('Date average','Month average','prcp average').where('Year',are.between_or_equal_to(1900,2020))
avgphoenix.plot('Year',['tmax average','tmin average'])
No description has been provided for this image

Question 1.2.4: Although still hotly debated (pun intended), many climate scientists agree that the effects of climate change began to surface in the early 1960s as a result of elevated levels of greenhouse gas emissions. How does the graph you produced in Question 1.2.3 support the claim that modern-day global warming began in the early 1960s?

Focusing on just the state of Arizona, it appears the minimum and maximum temperatures start increasing after 1960, rather than being nominal. It would also appear the minimum is increasing a lot faster than the maximum is.

Averaging temperatures across an entire year can obscure some effects of climate change. For example, if summers get hotter but winters get colder, the annual average may not change much. Let's investigate how average monthly maximum temperatures have changed over time in Phoenix.

Question 1.2.5: Create a monthly_increases table with one row per month and the following four columns in order:

  1. "Month": The month (such as "02 (Feb)")
  2. "Past": The average max temperature in that month from 1900-1960 (both ends inclusive)
  3. "Present": The average max temperature in that month from 2019-2021 (both ends inclusive)
  4. "Increase": The difference between the present and past average max temperatures in that month

First, make a copy of the phoenix table and add a new column containing the corresponding period for each row. The period refers to whether the year is in the "Past", "Present", or "Other" category. You may find the period function helpful to see which years correspond to each period. Then, use this new table to construct monthly_increases. Feel free to use as many lines as you need.

Hint: You can get the first three columns in one line by constructing a pivot table with the appropriate inputs for collect and values. Compute the increase column after that.

Note: Please do not re-assign the phoenix variable!

In [21]:
def period(year):
    "Output if a year is in the Past, Present, or Other."
    if 1900 <= year <= 1960:
        return "Past"
    elif 2019 <= year <= 2021:
        return "Present"
    else:
        return "Other"
    
pperiod=phoenix.apply(period,'Year')
phoenix2=phoenix.with_column('Period',pperiod).where('Period',are.not_equal_to('Other'))
phoenixpivot=phoenix2.pivot('Period','Month','tmax',np.average)
increase=phoenixpivot.column('Present')-phoenixpivot.column('Past')
monthly_increases =phoenixpivot.with_column('Increase',increase)
monthly_increases.show()
Month Past Present Increase
01 (Jan) 65.0164 67.8312 2.81479
02 (Feb) 68.8485 69.1859 0.337362
03 (Mar) 74.6499 75.9796 1.32965
04 (Apr) 82.6421 88.4 5.75792
05 (May) 91.4299 94.571 3.14104
06 (Jun) 101.166 105.734 4.56832
07 (Jul) 103.599 107.245 3.64654
08 (Aug) 101.416 107.384 5.96769
09 (Sep) 97.6874 101.238 3.55035
10 (Oct) 86.798 90.1667 3.36868
11 (Nov) 74.6273 80.5178 5.89046
12 (Dec) 65.9064 67.4548 1.54844
In [22]:
grader.check("q1_2_5")
Out[22]:

q1_2_5
passed! 🌈

February in Phoenix¶

The "Past" column values are averaged over many decades, and so they are reliable estimates of the average high temperatures in those months before the effects of modern climate change. However, the "Present" column is based on only three years of observations. February, the shortest month, has the fewest total observations: only 85 days. Run the following cell to see this.

In [23]:
# Just run this cell
feb_present = phoenix.where('Year', are.between_or_equal_to(2019, 2021)).where('Month', '02 (Feb)')
feb_present.num_rows
Out[23]:
85

Look back to your monthly_increases table. Compared to the other months, the increase for the month of February is quite small; the February difference is very close to zero. Run the following cell to print out our observed difference.

In [24]:
# Just run this cell
print(f"February Difference: {monthly_increases.row(1).item('Increase')}")
February Difference: 0.3373623297258632

Perhaps that small difference is somehow due to chance! To investigate this idea requires a thought experiment.

We can observe all of the February maximum temperatures from 2019 to 2021 (the present period), so we have access to the census; there's no random sampling involved. But, we can imagine that if more years pass with the same present-day climate, there would be different but similar maximum temperatures in future February days. From the data we observe, we can try to estimate the average maximum February temperature in this imaginary collection of all future February days that would occur in our modern climate, assuming the climate doesn't change any further and many years pass.

We can also imagine that the maximum temperature each day is like a random draw from a distribution of max daily temperatures for that month. Treating actual observations of natural events as if they were each randomly sampled from some unknown distribution is a simplifying assumption. These temperatures were not actually sampled at random—instead they occurred due to the complex interactions of the Earth's climate—but treating them as if they were random abstracts away the details of this naturally occuring process and allows us to carry out statistical inference. Conclusions are only as valid as the assumptions upon which they rest, but in this case thinking of daily temperatures as random samples from some unknown climate distribution seems at least plausible.

If we assume that the actual temperatures were drawn at random from some large population of possible February days in our modern climate, then we can not only estimate the population average of this distribution, but also quantify our uncertainty about that estimate using a confidence interval.

We will now compute the confidence interval of the present February average max daily temperature. To unpack this statement, we are saying that this confidence interval is looking at present-day February conditions (think about which table is relevant to this), and the confidence interval is for present-day February average max daily temperatures. We will compare this confidence interval to the historical average (ie. the "Past" value in our monthly_increases table). How will we do the comparison? Well, since we are essentially interested in seeing if the average February max daily temperatures have changed since the past, we care about whether the historical average lies within the confidence interval we create.

Based on the information above, think what the null hypothesis and alternative hypothesis are.

Question 1.2.6. Complete the implementation of the function make_ci, which takes a one-column table tbl containing sample observations and a confidence level percentage such as 95 or 99. It returns an array containing the lower and upper bound in that order, of a confidence interval for the population mean constructed using 5,000 bootstrap resamples.

After defining make_ci, we have provided a line of code that calls make_ci on the present-day February max temperatures to output the 99% confidence interval for the average of daily max temperatures in February. The result should be around 67 degrees for the lower bound and around 71 degrees for the upper bound of the interval.

In [25]:
def make_ci(tbl, level):
    """Compute a level% confidence interval of the average of the population for 
    which column 0 of Table tbl contains a sample. This function should return an 
    array of the lower and upper bounds of the confidence interval.
    """
    stats = make_array()
    for k in np.arange(5000):
        resample=tbl.sample()
        stat = np.average(resample.column(0))
        stats = stats=np.append(stats,stat)
    lowernum=(100-level)/2
    uppernum=100-lowernum
    lower_bound = percentile(lowernum,stats)
    upper_bound = percentile(uppernum,stats)
    return make_array(lower_bound,upper_bound)

# Call make_ci on the max temperatures in present-day February to find the lower and upper bound of a 99% confidence interval.
feb_present_ci = make_ci(feb_present.select('tmax'), 99)
feb_present_ci
Out[25]:
array([ 67.08      ,  71.36352941])
In [26]:
grader.check("q1_2_6")
Out[26]:

q1_2_6
passed! 🙌

Question 1.2.7 The feb_present_ci 99% confidence interval contains the observed past February average maximum temperature of 68.8485 (from the monthly_increases table). What conclusion can you draw about the effect of climate change on February maximum temperatures in Phoenix from this information? Use a 1% p-value cutoff.

Note: If you're stuck on this question, re-reading the paragraphs under the February heading (particularly the first few) may be helpful.

We fail to reject the null hypothesis, so the increase is possibly from random sampling.

All Months¶

Question 1.2.8. Repeat the process of seeing whether the past average for each month is contained within the 99% confidence interval of the present average. Just as a note for the context, remember that these “averages” are averages of the max daily temperatures within those time periods. Code has already been written to print out the results in the format of a month (e.g., 02 (Feb)), the observed past average, the confidence interval for the present average, and whether the past average was contained in the interval or not.

Use the provided call to print in order to format the result as one line per month.

Hint: Your code should follow the same format as our code from above (i.e. the February section).

In [27]:
monthly_increases.where('Month','02 (Feb)').column('Past')
Out[27]:
array([ 68.84852002])
In [28]:
comparisons = make_array()
months = monthly_increases.column('Month')
for month in months:
    past_average= monthly_increases.where('Month',month).column('Past').item(0)
    present_observations = phoenix.where('Year', are.between_or_equal_to(2019, 2021)).where('Month', month)
    present_ci = make_ci(present_observations.select('tmax'), 99)
    
    # Do not change the code below this line
    GREEN_BOLD = '\033[1;32m'
    RED_BOLD = '\033[1;31m'
    RESET_STYLE = '\033[0m'
    within = (past_average >= present_ci.item(0)) and (past_average <= present_ci.item(1))
    if within:
        comparison = f'{GREEN_BOLD}contained{RESET_STYLE}'
    else:
        comparison = f'{RED_BOLD}NOT contained{RESET_STYLE}'
    comparisons = np.append(comparisons, comparison)
    
    print('For', month, 'the past avg', round(past_average, 1),
    'is', comparison,
    'in the interval', np.round(present_ci, 1),
    ', the 99% CI of the present avg. \n')
For 01 (Jan) the past avg 65.0 is NOT contained in the interval [ 66.4  69.2] , the 99% CI of the present avg. 

For 02 (Feb) the past avg 68.8 is contained in the interval [ 66.9  71.2] , the 99% CI of the present avg. 

For 03 (Mar) the past avg 74.6 is contained in the interval [ 74.   77.9] , the 99% CI of the present avg. 

For 04 (Apr) the past avg 82.6 is NOT contained in the interval [ 86.4  90.4] , the 99% CI of the present avg. 

For 05 (May) the past avg 91.4 is NOT contained in the interval [ 92.4  96.6] , the 99% CI of the present avg. 

For 06 (Jun) the past avg 101.2 is NOT contained in the interval [ 104.3  107.2] , the 99% CI of the present avg. 

For 07 (Jul) the past avg 103.6 is NOT contained in the interval [ 105.5  108.8] , the 99% CI of the present avg. 

For 08 (Aug) the past avg 101.4 is NOT contained in the interval [ 105.8  108.9] , the 99% CI of the present avg. 

For 09 (Sep) the past avg 97.7 is NOT contained in the interval [  99.2  103.1] , the 99% CI of the present avg. 

For 10 (Oct) the past avg 86.8 is NOT contained in the interval [ 87.8  92.4] , the 99% CI of the present avg. 

For 11 (Nov) the past avg 74.6 is NOT contained in the interval [ 78.1  82.9] , the 99% CI of the present avg. 

For 12 (Dec) the past avg 65.9 is contained in the interval [ 65.7  69.2] , the 99% CI of the present avg. 

In [29]:
grader.check("q1_2_8")
Out[29]:

q1_2_8
passed! 🙌

Question 1.2.9. Summarize your findings. After checking whether the past average (of max temperatures, as defined above 1.2.6) is contained in the 99% confidence interval for each month, what conclusions can we make about the monthly average maximum temperature in historical (1900-1960) vs. modern (2019-2021) times in the twelve months? In other words, what null hypothesis should you consider, and for which months would you reject or fail to reject the null hypothesis? Use a 1% p-value cutoff.

Hint: Do you notice any seasonal patterns?

Considering the null hypothesis: The increase in temperature maximum was caused by random sampling. Every month we reject the null hypothesis, except for the winter months (minus January, the coldest month) and March.

It is important to be wary of making conclusions based on the results of multiple hypothesis tests. Specifically, recall that a 1% p-value cutoff means that if the null is true, we would expect to incorrectly reject the null 1% of the time. Now, if we run multiple of these tests, the chance we observe an error will be greater than if we had just run one test – the error chance compounds.

Part 2: Drought¶

According to the United States Environmental Protection Agency, "Large portions of the Southwest have experienced drought conditions since weekly Drought Monitor records began in 2000. For extended periods from 2002 to 2005 and from 2012 to 2020, nearly the entire region was abnormally dry or even drier."

Assessing the impact of drought is challenging with just city-level data because so much of the water that people use is transported from elsewhere, but we'll explore the data we have and see what we can learn.

Let's first take a look at the precipitation data in the Southwest region. The southwest.csv file contains total annual precipitation for 13 cities in the southwestern United States for each year from 1960 to 2021. This dataset is aggregated from the daily data and includes only the Southwest cities from the original dataset that have consistent precipitation records back to 1960.

In [30]:
southwest = Table.read_table('southwest.csv')
southwest.show(5)
City Year Total Precipitation
Albuquerque 1960 8.12
Albuquerque 1961 8.87
Albuquerque 1962 5.39
Albuquerque 1963 7.47
Albuquerque 1964 7.44

... (788 rows omitted)

Question 2.1. Create a table totals that has one row for each year in chronological order. It should contain the following columns:

  1. "Year": The year (a number)
  2. "Precipitation": The total precipitation in all 13 southwestern cities that year
In [31]:
totals = southwest.group('Year',np.sum).drop('City sum').relabeled('Total Precipitation sum','Precipitation')
totals
Out[31]:
Year Precipitation
1960 149.58
1961 134.82
1962 130.41
1963 132.18
1964 123.41
1965 187.53
1966 120.27
1967 179.02
1968 136.25
1969 191.72

... (51 rows omitted)

In [32]:
grader.check("q2_1")
Out[32]:

q2_1
passed! 🍀

Run the cell below to plot the total precipitation in these cities over time, so that we can try to spot the drought visually. As a reminder, the drought years given by the EPA were (2002-2005) and (2012-2020).

In [33]:
# Just run this cell
totals.plot("Year", "Precipitation")
No description has been provided for this image

This plot isn't very revealing. Each year has a different amount of precipitation, and there is quite a bit of variability across years, as if each year's precipitation is a random draw from a distribution of possible outcomes.

Could it be that these so-called "drought conditions" from 2002-2005 and 2012-2020 can be explained by chance? In other words, could it be that the annual precipitation amounts in the Southwest for these drought years are like random draws from the same underlying distribution as for other years? Perhaps nothing about the Earth's precipitation patterns has really changed, and the Southwest U.S. just happened to experience a few dry years close together.

To assess this idea, let's conduct an A/B test in which each year's total precipitation is an outcome, and the condition is whether or not the year is in the EPA's drought period.

This drought_label function distinguishes between drought years as described in the U.S. EPA statement above (2002-2005 and 2012-2020) and other years. Note that the label "other" is perhaps misleading, since there were other droughts before 2000, such as the massive 1988 drought that affected much of the U.S. However, if we're interested in whether these modern drought periods (2002-2005 and 2012-2020) are normal or abnormal, it makes sense to distinguish the years in this way.

In [34]:
def drought_label(n):
    """Return the label for an input year n."""
    if 2002 <= n <= 2005 or 2012 <= n <= 2020:
        return 'drought'
    else:
        return 'other'

Question 2.2. Define null and alternative hypotheses for an A/B test that investigates whether drought years are drier (have less precipitation) than other years.

Note: Please format your answer using the following structure.

  • Null hypothesis: ...
  • Alternative hypothesis: ...
  • Null Hypothesis: Drought years have no effect on precipitation.
  • Alternative hypothesis: Drought years have an effect on precipitation.

Question 2.3. First, define the table drought. It should contain one row per year and the following two columns:

  • "Label": Denotes if a year is part of a "drought" year or an "other" year
  • "Precipitation": The sum of the total precipitation in 13 Southwest cities that year

Then, construct an overlaid histogram of two observed distributions: the total precipitation in drought years and the total precipitation in other years.

Note: Use the provided bins when creating your histogram, and do not re-assign the southwest table. Feel free to use as many lines as you need!

Hint: The optional group argument in a certain function might be helpful!

In [159]:
bins = np.arange(85, 215+1, 13)
drought = totals.group('Precipitation',drought_label).relabeled('Year drought_label', 'Label')
drought.hist('Precipitation',group='Label')
No description has been provided for this image

Before you continue, inspect the histogram you just created and try to guess the conclusion of the A/B test. Building intuition about the result of hypothesis testing from visualizations is quite useful for data science applications.

Question 2.4. Our next step is to choose a test statistic based on our alternative hypothesis in Question 2.2. Which of the following options are valid choices for the test statistic? Assign ab_test_stat to an array of integers corresponding to valid choices. Assume averages and totals are taken over the total precipitation sums for each year.

  1. The difference between the total precipitation in drought years and the total precipitation in other years.
  2. The difference between the total precipitation in others years and the total precipitation in drought years.
  3. The absolute difference between the total precipitation in others years and the total precipitation in drought years.
  4. The difference between the average precipitation in drought years and the average precipitation in other years.
  5. The difference between the average precipitation in others years and the average precipitation in drought years.
  6. The absolute difference between the average precipitation in others years and the average precipitation in drought years.
In [90]:
ab_test_stat = np.array([4,6])
In [91]:
grader.check("q2_4")
Out[91]:

q2_4
passed! 🙌

Question 2.5. Fellow climate scientists Noah and Sarah point out that there are more other years than drought years, and so measuring the difference between total precipitation will always favor the other years. They conclude that all of the options above involving total precipitation are invalid test statistic choices. Do you agree with them? Why or why not?

Hint: Think about how permutation tests work with imbalanced classes!

No, because it's talking about the average amount of precipitation per year so it isn't imbalanced.

Before going on, check your drought table. It should have two columns Label and Precipitation with 61 rows, 13 of which are for "drought" years.

In [38]:
drought.show(3)
Precipitation Label
89.21 drought
98.18 drought
98.7 drought

... (58 rows omitted)

In [39]:
drought.group('Label')
Out[39]:
Label count
drought 13
other 48

Question 2.6. For our A/B test, we'll use the difference between the average precipitation in drought years and the average precipitation in other years as our test statistic:

$$\text{average precipitation in "drought" years} - \text{average precipitation in "other" years}$$

First, complete the function test_statistic. It should take in a two-column table t with one row per year and two columns:

  • Label: the label for that year (either 'drought' or 'other')
  • Precipitation: the total precipitation in the 13 Southwest cities that year.

Then, use the function you define to assign observed_statistic to the observed test statistic.

In [155]:
def test_statistic(t):
    table=t.group('Label',np.average).column(1)
    return table.item(0)-table.item(1)

observed_statistic = test_statistic(drought)
observed_statistic
Out[155]:
-15.856714743589748
In [109]:
grader.check("q2_6")
Out[109]:

q2_6
passed! 🌈

Now that we have defined our hypotheses and test statistic, we are ready to conduct our hypothesis test. We’ll start by defining a function to simulate the test statistic under the null hypothesis, and then call that function 5,000 times to construct an empirical distribution under the null hypothesis.

Question 2.7. Write a function to simulate the test statistic under the null hypothesis. The simulate_precipitation_null function should simulate the null hypothesis once (not 5,000 times) and return the value of the test statistic for that simulated sample.

Hint: Using t.with_column(...) with a column name that already exists in a table t will replace that column with the newly specified values.

In [140]:
def simulate_precipitation_null():
    shuffled_labels=drought.sample(with_replacement=False).column('Label')
    shuffled_table=drought.select('Precipitation').with_column('Label',shuffled_labels)
    return test_statistic(shuffled_table)


# Run your function a couple times to make sure that it works
simulate_precipitation_null()
Out[140]:
-10.865272435897452
In [141]:
grader.check("q2_7")
Out[141]:

q2_7
passed! 🙌

Question 2.8. Fill in the blanks below to complete the simulation for the hypothesis test. Your simulation should compute 5,000 values of the test statistic under the null hypothesis and store the result in the array sampled_stats.

Hint: You should use the simulate_precipitation_null function you wrote in the previous question!

Note: Running this cell may take a few seconds. If it takes more than a minute, try to find a different (faster) way to implement your simulate_precipitation_null function.

In [142]:
sampled_stats = make_array()

repetitions = 5000
for i in np.arange(repetitions):
    diff=simulate_precipitation_null()
    sampled_stats= np.append(sampled_stats,diff)

# Do not change these lines
Table().with_column('Difference Between Means', sampled_stats).hist()
plt.scatter(observed_statistic, 0, c="r", s=50);
plt.ylim(-0.01);
No description has been provided for this image
In [143]:
grader.check("q2_8")
Out[143]:

q2_8
passed! 🚀

Question 2.9. Compute the p-value for this hypothesis test, and assign it to the variable precipitation_p_val.

In [157]:
precipitation_p_val = np.count_nonzero(sampled_stats <= observed_statistic) / repetitions
precipitation_p_val
Out[157]:
0.0248
In [149]:
grader.check("q2_9")
Out[149]:

q2_9
passed! 🙌

Question 2.10. State a conclusion from this test using a p-value cutoff of 5%. What have you learned about the EPA's statement on drought?

Since the p value is less than 0.05, the data is considered statistically significant, so we reject the null.

Question 2.11. Does your conclusion from Question 2.10 apply to the entire Southwest region of the U.S.? Why or why not?

Note: Feel free to do some research into geographical features of this region of the U.S.!

Probably not, because the Southwest region consists of a lot of different geographical features which each take a role in precipitation of that area.

Conclusion¶

Data science plays a central role in climate change research because massive simulations of the Earth's climate are necessary to assess the implications of climate data recorded from weather stations, satellites, and other sensors. Berkeley Earth is a common source of data for these kinds of projects.

In this project, we found ways to apply our statistical inference technqiues that rely on random sampling even in situations where the data were not generated randomly, but instead by some complicated natural process that appeared random. We made assumptions about randomness and then came to conclusions based on those assumptions. Great care must be taken to choose assumptions that are realistic, so that the resulting conclusions are not misleading. However, making assumptions about data can be productive when doing so allows inference techniques to apply to novel situations.